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Abstract 

This paper deals with the problem of asymptotically optimal detection of 
changes in regime-switching stochastic models. We need to divide the whole 
obtained sample of data into several sub-samples with observations belonging 
to different states of a stochastic models with switching regimes. For this pur- 
pose, the idea of reduction to a corresponding change-point detection problem 
is used. Both univariate and multivariate switching models are considered. For 
the univariate case, we begin with the study of binary mixtures of probabilistic 
distributions. In theorems 1 and 2 we prove that type 1 and type 2 errors of 
the proposed method converge to zero exponentially as the sample size tends to 
infinity. In theorem 3 we prove that the proposed method is asymptotically op- 
timal by the rate of this convergence in the sense that the lower bound in the 
a priori informational inequality is attained for our method. Several generaliza- 
tions to the case of multiple univariate mixtures of probabilistic distributions are 
considered. For the multivariate case, we first study the general problem of clas- 
sification of the whole array of data into several sub-arrays of observations from 
different regimes of a multivariate stochastic model with switching states. Then 
we consider regression models with abnormal observations and switching sets of 
regression coefficients. Results of a detailed Monte Carlo study of the proposed 
method for different stochastic models with switching regimes are presented. 

1. Introduction 

In this paper the problem of the retrospective detection of changes in stochastic 
models with switching regimes is considered. Our main goal is to propose asymptoti- 
cally optimal methods for detection and estimation of possible 'switches', i.e. random 
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and transitory departures from prevailing stationary regimes of observed stochastic 
models. 

First, let us mention previous important steps into this field. Models with switching 
regimes have a long pre-history in statistics (see, e.g., Lindgren (1978)). A simple 
switching model with two regimes has the following form: 

Yt = XtPi + Uit for the 1st regime 
Yt = XtP2 + U2t for the 2nd regime . 

For models with endogenous switchings usual estimation techniques for regres- 
sions are not applicable. Goldfeld and Quandt (1973) proposed regression models with 
Markov switchings. In these models probabilities of sequential switchings are supposed 
to be constant. Usually they are described by the matrix of probabilities of switchings 
between different states. 

Another modification of the regression models with Markov switchings was proposed 
by Lee, Porter (1984). The following transition matrix was studied: 

^ = [Pij]i,j=0,l^ Pij = P{h = j\It~i = i}- 

Lee and Porter (1984) consider an example with railway transport in the US in 
1880-1886s which were influenced by the cartel agreement. The following regression 
model was considered: 

logPt = /3o + fSiXt + (32lt + uu 

where = or = 1 in dependence of 'price wars' in the concrete period. 

Cosslett and Lee (1985) generalized the model of Lee and Porter to the case of 
serially correlated errors ut. 

Many economic time series occasionally exhibit dramatic breaks in their behavior, 
assocoated with with events such as financial crises (Jeanne and Mason, 2000; Cerra, 
2005; Hamilton, 2005) or abrupt changes in government policy (Hamilton, 1988; Sims 
and Zha, 2004; Davig, 2004). Abrupt changes are also a prevalent feature of financial 
data and empirics of asset prices (Ang and Bekaert, 2003; Garcia, Luger, and Renault, 
2003; Dai, Singleton, and Wei, 2003). 

The functional form of the 'hidden Markov model' with switching states can be 
written as follows: 

yt = Cs, + + et, {i) 
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where St is a random variable which takes the values St = 1 and St = 2 obeying a 
two-state Markov chain law: 

Pr{st = j\st-i = i,St-2 = k,.. .,yt-i,yt-2, • • •) = Pr{st = j\st-i = i) = Pij. (ii) 

A model of the form (1-2) with no autoregressive elements (0 = 0) appears to 
have been first analyzed by Lindgren (1978) and Baum, et al. (1980). Specifications 
that incorporate autoregressive elements date back in the speech recognition literature 
to Poritz (1982), Juang and Rabiner (1985), and Rabiner (1989). Markov-switching 
regressions were first introduced in econometrics by Goldfeld and Quandt (1973), the 
likelihood function for which was first calculated by Cosslett and Lee (1985). General 
characterizations of moment and stationarity conditions for Markov-switching processes 
can be found in Tjostheim (1986), Yang (2000), Timmermann (2000), and Francq and 
Zakoian (2001). 

A useful review of modern approaches to estimation in Markov- switching models 
can be found in Hamilton (2005). 

However, the mechanism of Markov chain modeling is far not unique in statistical 
description of dependent observations. Besides Markov models, we can mention mar- 
tingale and copula approaches to dealing with dependent data, as well as description 
of statistical dependence via different coefficients of 'mixing'. All of these approaches 
are interrelated and we must choose the most appropriate method for the concrete 
problem. In this paper we choose the mixing paradigm for description of statistical 
dependence. 

Remark that T/^-mixing condition is imposed below in this paper in order to obtain 
the exponential rate of convergence to zero for type 1 and type 2 error probabilities 
(see theorems 1 and 2 below). Another alternative was to assume a-mixing property 
which is always satisfied for aperiodic and irreducible countable-state Markov chains 
(see Bradley (2005)). Then we can obtain the hyperbolic rate of convergence to zero 
for type 1 and type 2 error probabilities. For the majority of practical applications, 
it is enough to assume r-dependence (for a certain finite number of lags r > 1) of 
observations and state variables. Then all proofs become much shorter. 

Now let us mention some important problems which lead to stochastic models with 
switching regimes. 
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Splitting mixtures of probabilistic distributions 

In the simplest case we suppose that the d.f. of observations has the following form: 



F{x) = (l-e)Fo(x) + eFi(x), 



where Fo{x) is the d.f. of ordinary observations; Fi{x) is the d.f. of abnormal observa- 
tions; < e < 1 is the probability of obtaining an abnormal observation. 

We need to test the hypothesis of statistical homogeneity (no abnormal observa- 
tions) of an obtained sample = {xi,X2, ■ ■ ■ , Xjsf}. If this hypothesis is rejected then 
we need to estimate the share of abnormal observations (e) in the sample and to classify 
this sample into sub-samples of ordinary and abnormal observations. 

Estimation for regression models with abnormal observations 

The natural generalization of the previous model is the regression model with ab- 
normal observations 



where Y is the n x 1 vector of dependent observations; X is the n x k matrix of 
predictors; /3 is /c x 1 vector of regression coefficients; e id the n x 1 vector of random 
noises with the d.f. of the following type: 



where < 5 < 1 is the probability to obtain an abnormal observation; /o(x) is the 
density function of ordinary observations; is the density function of abnormal 

observations. For example, in the model with Ruber's contamination [Huber, 1985]: 



Estimation for regression models with changing coefficients 

Regression models with changing coefficients is another generalization of the con- 
tamination model. Suppose a baseline model is described by the following regression: 



y = X/3 + e, 



f^(x) = {l-5)fo{x) + 5f,{x) 



/o(-)=Ar(0,a2), /i(.)=Ar(0,A2). 



where the mechanism of a change is purely random: 




/3 with the probability 1 — e 
7 with the probability e, 
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and /3 7^ 7. 

We need again to test the hypothesis of statistical homogeneity of an obtained sam- 
ple and to divide this sample into sub-samples of ordinary and abnormal observations 
if the homogeneity hypothesis is rejected. 

The goal of this paper is to propose methods which can solve these problems ef- 
fectively. Theoretically, we mean estimation of type 1 and type 2 errors in testing 
the statistical homogeneity hypothesis and with estimation of contaminations param- 
eters in the case of rejectiong this hypothesis. Practically, we propose procedures for 
implementation of these methods for univariate and multivariate models. 

The structure of this paper is as follows. First, we consider univariate models with 
switching effects. For binary mixtures of probabilistic distributions we prove theorem 
1 about exponential convergence to zero of type 1 error in classification (to detect 
switches for a statistically homogenous sample) as the sample size tends to infinity; 
theorem 2 about exponential convergence to zero of type 2 error (vice versa, to accept 
stationarity hypothesis for a sample with switches); and theorem 3 which establishes 
the lower bound for the error of classification for binary mixtures. From theorems 2 
and 3 we conclude that the proposed method is asymptotically optimal by the order 
of convergence to zero of the classification error. 

Different generalizations of the proposed method for the case of univariate models 
with multiple switching regimes and for multivariate models with switching regimes 
are considered. Results of a detailed Monte Carlo study of the proposed method for 
different stochastic models with switching regimes are presented. 

2. Univariate models 
2.1. Binary mixtures 

2.1.1. Problem statement and description of the detection/estimation 
method 

Suppose the d.f. of the observations is the binary mixture 

/(x) = (l-e)/o(x) + e/o(x-/i), 

where e, h are unknown. 
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The problem is to estimate parameters e, h by the sample = {xi}f^i, where all 
Xi has the same d. f. /(■). 

An ad hoc method of estimation of these parameters is as follows: ordinary and 
'abnormal' observations are heuristically classified to two sub-samples and the estimate 
e is computed as the share of the size of sub-sample of abnormal observations in the 
whole sample size. Clear, this method is correct only for large values of h. However, 
this idea of two sub-samples can be used in construction of more subtle methods of 
estimation. 

The estimation method is as follows: 

1) From the initial sample compute the estimate of the mean value: 

1 

i=l 

2) Fix the parameter 6 > and classify observations as follows: if an observation 
falls into the interval {On — b,6N + b), then we place it into the sub-sample of ordinary 
observations, otherwise - to the sub-sample of abnormal observations. 

3) Then for each 6 > we obtain the following decomposition of the sample 
into two sub-samples 

Xi{b) = {xi,X2, . . . ,57Vi}, \xi - 9n\ < b, 

X2{b) = {xi,X2, . . . ,Xn2}, \Xi-9N\>b 

Denote by Ni = Ni{b), N2 = N2{b), N = Ni + N2 the sizes of the sub-samples Xi and 
X2, respectively. 

The parameter b is chosen so that the sub-samples Xi and X2 are separated in the 
best way. For this purpose, consider the following statistic: 

, A^i N2 

^N{b) = —{N2Y,S^^-NlJ2^^)■ 

i=l i=l 

4) Define the boundary C > and compare it with the value J = max |\I^7v(&)| 
on the set 6 > 0. If J < C then we accept the hypothesis Hq about the absence of 
abnormal observations; if, however, J > C then the hypothesis Hq is rejected and the 
estimates of the parameters e and h are constructed. Remark that our primary goal is 
to separate ordinary and abnormal observations in the sample. Evidently, classification 
errors must be small and therefore we have to require some kind of convergence of the 
estimate e^r to its true value e. 



6 



5) If J > C then define the number 6^: 

6tr G arg max I \1/ n 

b>0 

Then 

e*N = N2{b%)/N, hi, = er,/e*^. 
are the nonparametric estimates for e and /i, respectively. 

In the general case for construction of unbiased and consistent estimates of e and 
h we can use the following relationships: 

(-N hN = On 

We will show that, under some conditions, the estimates and tend almost 
surely to the true values e and h as N ^ oo. The sub-sample of abnormal observations 
isX2(6^). 

2.1.2. Main results 

Let us formulate the main assumptions. 
Al. Mixing conditions 

On the probability space (fi, P) let "Hi and be two a-algebras from 3". Con- 
sider the following measure of dependence between "Hi and 'H2'- 

P{AB) 



'(/'(Hi, 7^2) = sup 

Ae'Hi,Ben2,F{A)F{B)ytO 



P{A)P{B) 



Suppose {yn}, n > 1 is a sequence of random variables defined on (fi, P). Denote 
by = (j{yi :s<i<t},l<s<t<oo the minimal cr-algebra generated by random 
variables yi, s < i < t. Define 



t>i 



We say that random sequence {yn} satisfies the ip-mixing condition if the function 
?/'(n) (which is also called the ip -mixing coefficient) tends to zero as n goes to infinity. 

The -j/^-mixing condition is satisfied in most practical cases. In particular, for a 
Markov chain (not necessarily stationary), if tp{n) < 1 for a certain n, then tp{k) goes 
to zero at least exponentially as A; —j- 00 (see Bradley, 2005, theorem 3.3). 
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A2. Cramer condition 

We say that the sequence {yn} satisfies the uniform Cramer condition if there exists 
T > such that for each i, Eexp(t?/i) < oo for \t\ < T. 

For a centered sequence this condition is equivalent to the following (see Petrov, 
1987): there exist g > 0, H > such that 

Ee^y- < e59*', \t\ < H, 

for all n = 1, 2, . . . . 

We assume that conditions Al and A 2 hold true everywhere in the paper. 
For any x > let us choose the number 7(x) from the following condition: 

r 

ln(l + 7(x)) = <^ 4^ 

[ — , X > gH, 

where g, H are taken from the uniform Cramer condition. 

For the chosen 7(x), let us find such integer </'o(x) > 1 from the ?/;-mixing condition 
that < 7(x) for / > (poix). 

Below we denote by Po(Eo), Pe(Ee) measure (mathematical expectation) of the 
sequence under the condition e = or /i = (no 'abnormal' observations) and 
under the condition eh ^ 0. 

In the following theorem the exponential upper estimate for type 1 error is obtained 
for the proposed method. 

Theorem 1. 

Let e = 0. Suppose the d.f. /o(-) is symmetric w.r.t. zero. Then for any C > the 
following estimate holds: 

Polmax \^Nib)\ > C} < A(j)o(CN/2) exp(-L(C)N), 

b>0 

(HC \ 
— r^, — I , the constants gf, H are taken from 
80o(CiV/2)' l6<j)l{CN/2)gJ' 

the uniform Cramer condition. 

The proof of theorem 1 is given in the Appendix. 

Now consider characteristics of this method in case eh ^ 0. Here we again assume 
that EoXi = 0, i = 1, . . . ,N. 
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Put (for any fixed e, h) 

eh+b eh+b 

^i^) — I f{x)xdx, dip) = J f\x)dx 

eh~b eh—b 

^(6) = r{h) - ehdih) 
and consider the equation 

f{eh + h) = f{eh-h). 
In the following theorem type 2 error is studied. 



Theorem 2. 

Suppose all assumptions of theorem 1 are satisfied and there exists r* = sup r{h). 

b 

Suppose also that / (■) 7^ ^^id continuous. Then for < C < max we have 



1) 



Pe{max \^Nib)\ <C}< AMCN/2 + r*) exp{-L{6)N) 

b 

52 H6. 



where 5 = max — C > 0, L{S) = mm( ^^^ , 



2) Suppose, moreover, that equation (1) has a unique root b* (for any fixed e,h). 



Then 

blj- — )■ b* P^-a.s. as — )■ 00; 

3) The estimates e^, converge P^-a.s. to the true values of the parameters e, h, 
respectively, as — )• cx). 

The proof of theorem 2 is given in the Appendix. 

2.1.3. Recommendations for the choice of the threshold C 

For practical applications of the above obtained results we need to know the thresh- 
old C. 

a) In order to compute this threshold, at least one training sample without switch- 
ings is needed. 

b) For this sample we compute the threshold C from the following empirical formula 
which follows from theorem 1: 



N 

where A^ is the sample size, is the variation of </)o-dependent observations and a is 
the 1st type error level. 



In other words we compute the dispersion of observations and the integer 0o 
(by the first zero lag of the autocorrelation function of the training sample). Then we 
compute the threshold C. 

Let us give one example which explains how to do it in practice. 

Consider the following model (without switchings) 

x{n) = px{n — \) + a n = l,...,N, 

where ^„ are i.i.d.r.v.'s with the d.f. A^(0, 1), and replacing 0o(') by (1 — p)~^. 

As a result, the following regression relationship for the threshold C was obtained: 

log{C) = -0.9490-0.4729*/o^(iV)+1.0627*%((T)-0.6502*/o^(l-p)-0.2545*/o^(l-a). 

(2) 

Remark that = 0.978 for this relationship and its residuals are stationary at the 
error level 5%. The elasticity coefficient for the factor is close to its theoretical value 
1/2. The calibration coefficient exp(— 0.949) = 0.3871 here depends on the Gaussian 
d.f. of observations. 

We have to note that in practice, we need to calibrate the above formula for the 
threshold C using several homogenous samples. 

Examples 

The proposed method was tested in the following experiments. 
In the first series of tests the following mixture model was studied: 

/,(x) = (l-e)/o(x)+e/o(x-/i), /o(-) = Ar(0, 1), < e < 1/2. 

First, the critical thresholds of the decision statistic maXb>o I^Af(^)l were computed. 
For this purpose we use the above formula for the threshold C for the values a = 
0.95, p = 0, a = l.The threshold values C in each experiment are presented in table 
1. 

Table 1. 



N 


50 


100 


300 


500 


800 


1000 


1200 


1500 


2000 


a = 0.95 


0.1681 


0.1213 


0.0710 


0.0534 


0.044 


0.0380 


0.037 


0.034 


0.029 


a = 0.99 


0.1833 


0.1410 


0.0869 


0.0666 


0.050 


0.0471 


0.0390 


0.038 


0.035 



In the second series of tests the threshold value for a = 0.95 was chosen as the 
critical threshold C in experiments with non- homogenous samples (for e 7^ 0). For 
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different sample sizes in 5000 independent trials of each test, the estimate of type 2 
error W2 ( i.e. the frequency of the event max |\l'jv(6)| < C for e > 0) and the estimate 

h 

i of the parameter e were computed. The results are presented in table 2. 



Table 2. 



e = 0.1 


h=2.0 


h=1.5 


N 


300 


500 


800 


1000 


800 


1200 


2000 


3000 


C 


0.0710 


0.0534 


0.044 


0.038 


0.044 


0.037 


0.029 


0.022 




0.26 


0.15 


0.05 


0.02 


0.62 


0.42 


0.16 


0.03 


e 


0.104 


0.101 


0.097 


0.099 


0.106 


0.103 


0.102 


0.0985 



2.1.4. Asymptotic optimality 

Now consider the question about the asymptotic optimality of the proposed method 
in the class of all estimates of the parameter e. The a priori theoretical lower bound 
for the estimation error of the parameter e in the model with i.i.d. observations with 
d.f. = (1 — e)/o(x) + €fi{x) is given in the following theorem. 

Theorem 3. Let Ai^ be the class of all estimates of the parameter e. Then for 
any < 6 < e, 

liminf inf sup — lnPe{|eAr — e| > 5} > — 5^ J(e), 

iV-i>oo e^eMN 0<e<l/2 ^ 

where J(e) = J [{fo^x) — dx is the generalized distance between den- 

sities /o(x) and and is the measure corresponding to the density /^(x). 

Proof. 

Remark that it suffices to consider consistent estimates of the parameter e (for non- 
consistent estimates the limit in the left hand of the above inequality is equal to zero). 
This class is not empty because of the method proposed in the paper. 

Suppose is any consistent estimate of e and < 5 < 5' . Consider the random 
variable \n = ^n{xi, . . . , xn) = ^{\€n — e| > <^}, where 1(^1) is the indicator of the set 
A. 

Then for any d > 0: 

P,{\eN -e\>S} = EAn > E,(A,vI{/(X^, e + S')/f{X'', e) < e'^}), 
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where /(X^, e) is the hkehhood function of the sample of observations with the 
density function f\{x), i.e. 

N 

/(X^6) = ^[(l-e)/o(^0 + e/l(a;0]■ 

i=l 

Further, 

< e }) > 

> e""^ (P.+,'{|6^ - e| > 5} - P.+,'{/(X^, e + > e^}). 

Since is a consistent estimate, Pg_,_y{|eAr — e| >5}— i-lasA^— J-oo. 

Let us consider the probabihty , ^{/(X^, e + 5')/ f{X^ , e) > e'^}. We have 



/(X^6 + ^') ^ ^ . /i(x.)-/ o(x.). 



On the other hand, 



— wr~ = ' 1 — 'm — = ' 



Therefore, choosing d = N{{5'Y + K,)J{e), k = o(((5')^), we obtain 

P,+y {/(X^, e + ^V/l^"^, e) > e^} as iV -> oo. 

Thus, 



P,{|eV-e|>5}>(l-o(l))e- 

or 



~N5'^ J{e) 



hminf inf sup — InP^deAr — e| > 5} > — 5^ J(e), 

N^oo eN^MN 0<e<l/2 ^ 

Theorem 3 is proved. 

Comparing results of theorems 2 and 3 we conclude that the proposed method 
is asymptotically optimal by the order of convergence of the estimates of a mixture 
parameters to their true values. 



2.1.5. Generalizations: non-symmetric distribution functions 
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Results obtained in theorems 1 and 2 can be generalized to the case of non- 
symmetric distribution functions. Suppose the d.f. /o(-) is asymmetric w.r.t. zero. 
Then we can modify the proposed method as follows. 

1. From the initial sample = {xi, . . . ,Xn} compute the mean value On = 
1 ^ 

— Xi and the sample = {yi, . . . ^i/n}', Ui = — 9^. Then we divide the 

i=l 

sample into two sub-samples Ii{b), hib) as follows: 




h{b) = {m, . . . , yjvi(fe)}, -(p{b) <yi<b 
hip) = {yi, yN2(b)}, yi > b oi yi< -(j){b), 



b 

where the function (f){b) is defined from the following condition: = J y fo{y)dy, 

-m 

f^{y) = /^(x-eN), N = Ni{b) + N2{b) and Ni{b),N2{b) are sample sizes of hp), 
respectively. 

2. As before we compute the statistic 

Ni{b) Niib) 
i=l i=l 

3. Then the value J = max^ |\E'jv(^)| is compared with the threshold C. li J < C 
then the hypothesis Hq (no abnormal observations) is accepted; if, however, J > C 
then the hypothesis Hq is rejected and the estimate of the parameter e is constructed. 

4. For this purpose, define the value 6^: 

6^ G argmax |\&Ar(6)|. 

b>0 

Then 

4 = N2{b*^)/N. 



Consider application of this method for the study of the classic e-contamination 
model: 

f,{.) = {l-e)Af{fi,a^) + eAfifi,A^), » a^, < e < 1/2. 
For this model, the method described above has the form: 

1. From the sample of observations = {xi, . . . ,X]\f} the mean value estimate 
fi = Xli^i ^i/^ w^-s computed. 
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2. The sequence Ui = (xj — fi)"^, i = 1,...,N and its empirical mean 6^ = 
J2iLi Ui/N ai's computed. 

3. Then for each h G [0,i?maa;], where Bmax is a certain a priori chosen maximal 
value of the parameter 6, the sample = {yi, . . . , y^} is divided into two sub-samples 
in the following way: for 6i\f{l — 0(&)) < yi < ^Ni^ + b) put jji = yi (the size of the 
sub-sample A''i = Ni{b)), otherwise put yi = yi (the size of the sub-sample N2 = N2{b)). 
Here we choose the function (f){b) from the following condition: 

yfoiy)dy = 0, 
eiv (!-</.(&)) 

where /o(-) = iV(0, (1 - e) V^). 
From here we obtain: 

m = 1 - 

4. For any b G [0,Bmax], the following statistic is computed: 

i=l i=l 

where N = Ni + N2, Ni = Ni{b), N2 = N2{b) are sizes of sub-samples of ordinary and 
abnormal observations, respectively. 

5. Then, as above, the threshold C > is chosen and compared with the value 
J = maxfo |\l/7v(&)|. If J < C then the hypothesis Hq (no abnormal observations) is 
accepted; if, however, J > C then the hypothesis Hq is rejected and the estimate of 
the parameter e is constructed as follows. 

Define the value 6^: 

b*j^ G argmax |\&Ar(6)|. 

b>0 

Then 

= N2{b*^)/N. 

Remark. For estimation of the threshold, we use the approach described in 2.1.3. 

In experiments the critical values of the statistic max^ |\l/7v(^)| were computed. For 
this purpose, as above, for homogenous samples (for e = 0), a-quantiles of the decision 
statistic maxfo |\&7v(&)| were computed (a = 0.95, 0.99). The results obtained in 5000 
trials of each test are presented in table 3. 
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Table 3. 



N 


50 


100 


300 


500 


800 


1000 


1200 


1500 


2000 


0.95 


0.3031 


0.2330 


0.1570 


0.1419 


0.1252 


0.1244 


0.1146 


0.1107 


0.1075 


0.99 


0.3699 


0.2862 


0.1947 


0.1543 


0.1436 


0.1331 


0.1269 


0.1190 


0.1157 



The quantile value for a = 0.95 was chosen as the critical threshold C in exper- 
iments with non-homogenous samples (for e ^ 0). For different sample sizes in 5000 
independent trials of each test, the estimate of type 2 error W2 ( i.e. the frequency 
of the event max |\E'7v(^)| < C for e > 0) and the estimate e of the parameter e were 

b 

computed. The results are presented in tables 4 and 5. 



Table 4. 



A = 3.0 




e = 


0.05 






N 


300 


500 


800 


1000 




C 


0.1570 


0.1419 


0.1252 


0.1244 






0.27 


0.15 


0.06 


0.04 




e 


0.064 


0.056 


0.052 


0.05 




Table 5. 












A = 5.0 


e = 0.01 


N 


1000 


1200 


1500 


2000 


3000 


C 


0.1244 


0.1146 


0.1107 


0.1075 


0.1019 




0.25 


0.20 


0.15 


0.10 


0.04 


e 


0.0135 


0.013 


0.012 


0.011 


0.010 



2.2. Multiple switchings 

Suppose we obtain the data = {xi, . . . ,Xjv}, where the d.f. of an observation 
Xi can be written as follows: 

f{xi) = (1 - ei efc) fo{xi - hi) + ei fo{xi - /12) H h fo{xi - hk+i), 

where ei > 62 > • ■ ■ > ^A; > 0, < ei + ■ ■ ■ -|- < 1, \hi\ < \h2\ < ■ ■ ■ < \hk^i\. 

Suppose that the d.f. fo{x) is symmetric w.r.t. x = and min — \hi\) > 

l<i<k 

B>0. 
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Our goal is to test the hypothesis = 0, s = 1, . . . , /c (no switches) and in case this 
hypothesis is rejected to estimate the number of switches k > 1 and the parameters 
of the model ei, i = 1, . . . ,k, and hj, j = 1, . . . ,k + 1. In this section we denote by 
Ej, 2 = 0, 1, . . . , fc, the mathematical expectation of random variables corresponding to 
the d.f. with shift hi{hQ =^0). 

This model has the following sense. In the case of a binary switching we have 
ordinary and abnormal observations. In the case of multiple switchings abnormal 
observations are from different classes. The idea to use the sample mean as a reference 
point of the above described method is no more valid, because in case of many classes 
it can be greatly biased towards the maximal \hi\. Instead, we use the reference points 
from the histogram of the obtained sample. Concretely, we do as follows. 

1. Construct the histogram hist]\f{t) of data by the whole sample obtained. 
Find argmax histi\f{t). An arbitrary point from this set is assumed to be the reference 
point On used in the following algorithm for a binary switching model. 

1.1. Fix the parameter b > and classify observations as follows: if an observation 
falls into the interval {9n — b,9N + b), then we place it into the sub-sample of ordinary 
observations, otherwise - to the sub-sample of abnormal observations. 

1.2. Then for each 6 > we obtain the following decomposition of the sample 
into two sub-samples 

Xi{b) = {Xi,X2, . . . ,XArJ, \Xi - 9n\ < 6, 
X2(6) = {Xi,X2, . . . ,XArJ, \Xi - 9n\ > b 

Denote by Ni = Ni{b), N2 = N2{b), N = Ni + N2 the sizes of the sub-samples Xi and 
X2, respectively. 

The parameter b is chosen so that the sub-samples Xi and X2 are separated in the 
best way. For this purpose, consider the following statistic: 

^N{b) = —{N2J2 - 

i=l i=l 

1.3. Define the boundary C > and compare it with the value J = max |\i/Ar(6)| 
on the set < b < B. li J < C then we accept the hypothesis Hq about the absence 
of abnormal observations; if, however, J > C then the hypothesis Hq is rejected and 
the estimates of the parameters e =^ (ei -|- ■ ■ ■ -|- e^) and hi are constructed. Remark 
that our primary goal is to separate ordinary and abnormal observations in the sample. 
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Evidently, classification errors must be small and therefore we have to require some 
kind of convergence of the estimate e^r to its true value e. 
1.4. Define the number b*^: 

blj e arg max |\l/Ar(6)|. 

0<b<B 

Then 

eN = N2{b%)/N. 

2. As a result, we obtain two classes of observations at the first step (ordinary and 
abnormal data) and the estimate of the sum e = (ei + ■ ■ ■ + Cfc), as well as the 
estimate of the average EoXj. 

3. Then we remove all found 'ordinary' observations from the sample and repeat 
steps 1 and 2. As a result, we obtain the estimate ei of the parameter ei, as well as 
the estimate of the average Ei Xi. 

4. So we proceed further until a sub-sample without switches is obtained (i.e. the 
decision threshold C is not exceeded). As a result, we obtain the estimate /cjv of the 
number of classes k, as well as the estimates of the parameters ei > ■ ■ ■ > > and 
averages Eq Xj, Ei Xj, . . . , E^ Xj. 

We see that this method is based upon reduction to the case of a binary switching 
model. In this case we characterize the quality of a method by the performance criteria 
of the right estimation of the number of classes (i.e. kjy = k) and the accuracy of 
estimation (e.g., max \ei — ej| in the case k^ = k). So we must use the following 

i 

performance criterion: 

^e{{kN 7^ k)U (^(max - e^j > 6) n {k^ = k)j}. 

However, we see that the crucial thing is to correctly estimate the number of classes 
k. The estimates of the parameters hi, ... , h^+i are assumed to be the reference points 
at each step of the above described recurrent procedure. Then consistent estimates 
of Si can be obtained by some of standard methods (e.g., the method of moments). 
Therefore we use the following performance criterion: 

PA^N + k}. 

Remark that the 1st type error for multiple switchings can be estimated like in 
the binary case (we do not formulate this result). As to the 2nd type error (i.e. the 



17 



probability that we stop at the 1st step of the method because the decision threshold 
is not exceeded) just observe that a binary switch is a particular case of the general 
multiple switching situation (when all ej beginning from i = 2 are equal to zero). 
Therefore 

Pe{2n(i type error, multiple switches} < 'P^{2nd type error, binary case} 

< Liexp(-L(5)iV), 

for < 5 < max |^(&)| — C*, where Li5) = mini — — ), Li = 40o. 

0<b<Br,^ax 16005^ 800 

Now consider the event {k^ ^ k} = {kj\f < k} U {/cat > k}. 

The event {/c^v < k} means that at a certain recurrent step of the above described 
procedure a sub-sample of remaining observations (after eliminations of all previous 
sub-samples) is considered to be "pure" (i.e. without switches) but in reality it contains 
some more switches. The probability of this event is less than the 2nd type error at 
this step of the procedure. Therefore, 

P,{kN <k}< Liexp{-L{6)N), 

for < S = max |^(&)| — C, where Li6) = mini 5—, ), Li = 40o. 

o<b<B„^ax 160o5' 800 

The event {k^ > k} means that finally some more switches are detected in the 

obtained sample than in reality. The probability of this event is less than the 1st type 

error at the final step of the above recurrent procedure: 



P,{kN >k}< Liexp(-L(C)iV), 



Therefore the following theorem holds. 



where L{C) = min{ , — — ), Li = 40o. 

160g5( 800 



Theorem 4. 

Suppose < C < max |^(&)|- Then the 2nd type error probability is estimated 

0<b<Bmax 

from above as follows: 

Pgj 2nd type error } < Liexp{—L{6)N), 

6^ H6 . 

o<b<B^ax ^ 160o5' ' 800 ' 

Moreover, the estimate of the number of switchings k^ converges a.s. to the true 

value of A; as — 7- 00 and 



where < 5 = ^^max \^ii>) \ — C, Li^S) = mm( ^^^ , 3-3—), Li = 'iqjQ. 



Pe{kN 7^k}< Li(exp(-L(5)iV) + exp(-L(C)iV)), 
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where < 6 



min[ 



max 1^ 

0<b<B^a^ 



C and L{C) 



C2 HC. 



min[ 



0^' 800 



Example 

Let us consider the following example. Suppose we have the model with three 
classes of observations: 

f{xi) = (1 - ei - 62) fo{xi - hi) + ei fo{xi - /12) + ^2 fo{xi - , 2 = 1,..., A^, 



3, parameters 



where /o(-) = A/'(0, 1); Xi are i.r.v.'s. 

The problem is to estimate the unknown number of classes k 
hi, h2, /is, and ei, €2 by the sample = {xi, . . . ,Xn}. 

Concretely, in this model the following parameters were chosen: 

ei = 0.3; 62 = 0.15 

hi = 1, = 3, hs = 7. 

For estimation of the decision threshold, the above empirical formula (2) can be 
used: 

log{C) = -0M90-0A729*log{N) + lM27*log{a)-0.6502*log{l-p)-0.25A5*log{l-a) 

Again remark that the elasticity coefficient for the factor is close to its theoretical 
value —0.5. 

In experiments we estimated the number of classes /cat and the corresponding error 
erTv = Pe{kN 7^ k}. 

The following results were obtained (each cell of this table is the average in 1000 
replications): 

Table 6. 



A^ 


100 


200 


300 


500 


700 


1000 


1500 


erN 


0.116 


0.090 


0.070 


0.048 


0.036 


0.016 


0.010 
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3. Multivariate models 
3.1. Multivariate classification 
Binary mixtures 

Now let us consider the multivariate classification problem with binary mixtures. 
Suppose multivariate observations are of the following type: 

The multivariate density function of the vector X" is 

/(X") = (l-6)/o(X")+6/i(X"), 

where /o(-)) /i(') ^ire the d.f.'s of ordinary and abnormal observations, respectively; the 
d.f. /o(-) is supposed to be symmetric w.r.t. its mean vector. 

First, let us consider the case Ei(X") = a 7^ 0, i.e. changes in mean of abnormal ob- 
servations. Remark that the baseline "change-in- mean" problem is usually considered 
in many methods of 'cluster analysis' in which different distances between multivariate 
'points' of characteristics (even without references to density functions and mathemat- 
ical expectations of observations) are considered. 

The method can be formulated in analogy with the univariate case: 

1) From the initial sample X^ compute the estimate of the mean value: 

1 ^ 

i=l 

2) Fix the parameter & > and classify observations as follows: 

if — OnW < b, then we place X* into the sub-sample of ordinary observations 

if — 6* AT II > b, then we place X* into the sub-sample of abnormal observations 

As a result, for each 6 > we obtain the decomposition of the sample X^ into 
sub-samples of ordinary and abnormal observations. Suppose the size of ordinary sub- 
sample is Ni{b) and the size of abnormal sub-sample is N2{b). 

3) The parameter b can be chosen in order to separate the sub-samples of ordinary 
and abnormal observations ({^^*} and {Y^}, respectively) in the best way. For this 
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purpose, consider the following statistic: 

1=1 i=l 

4) Define the boundary C > and compare it with the value J = max ||\E'Ar(fe)|| 

b 

on the set 6 > 0. If J < C then we accept the hypothesis Hq about the absence of 
abnormal observations; if, however, J > C then the hypothesis Hq is rejected and the 
estimates of the parameters e and a are constructed. 

Remark that our primary goal is to separate ordinary and abnormal observations 
in the sample. Evidently, classification errors must be small and therefore we have to 
require some kind of convergence of the estimate e^r to its true value e. 

5) Define the number h*jq: 

b*j^ e argmax||^^(6)||. 

Then 

= iV2(6^)/iV, a*^ = 9^/el,. 

are the nonparametric estimates for e and a, respectively. 

Our main results in this case are analogous to the univariate situation. 

Theorem 5. 

Suppose e = and the d.f. /o(-) is symmetric w.r.t. its mean vector. Then for any 
C > the following upper estimate for the probability of the 1st type error holds: 

Po{max \\^N{b)\\ > C} < 40o(CiV/2) exp(-L(C)iV), 

the uniform Cramer condition. 

For the 2nd type error we can formulate the following result. 

Theorem 6. 

Suppose all assumptions of theorem 5 are satisfied and there exists r* = sup r{h). 

b 

Suppose also that / (■) 7^ and continuous. Then for < C < max |^(&)| we have 

b 



1) 



P,{max \\^N{h)\\ <C}< 40o(CiV/2 + r*) exp(-L(5)A^) 

b 
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^2 jj^ 

where 6 = max ||\E'(6)|| — C > 0, L(6) = mini 5—, ). 

h I'ocptg 800 

2) Suppose, moreover, that equation (*) has a unique root h* . Then 

6^ — )■ 6* Pe-a.s. as — )■ 00; 

This method deals with binary mixtures of multivariate d.f.'s. Its generalization 
to multiple classes of multivariate d.f.'s can be obtained in analogy with the previous 
section. 

Multiple switches 

In this case the multivariate density function of the vector X" is 
/(X") = (1 - ei efc)/o(X" - /ii) + ei /o(X" - Z.^) + ■ ■ ■ + /o(X" - hu+i) 

where ei > e2 > ■ ■ ■ > > 0, < ei + ■ ■ ■ + < 1, ||/ii|| < ||/i2|| < ■ ■ ■ < 

Suppose that the d.f. fo{x) is symmetric w.r.t. x = and min — > 

i<'t<fc 

5 > 0. 

In order to estimate the number of classes /c, as well as parameters we do as 
follows: 

From the sample of initial multivariate observations 

X^ = {xl...,xl), n = l,...,iV. 

we build the sample of their Euclidean norms: 

Yn = WW = ^{xlY + ■■■ + {xl)\ n = l,...,iV. 

1. Construct the histogram histN{t) of data by the whole sample = 
(li, . . . , Yat). Find argmax /izstjv(t). An arbitrary point from this set is assumed 
to be the reference point 6n used in the following algorithm for a binary switching 
model. 

1.1. Fix the parameter & > and classify observations as follows: 
if II — ^Af|| < b, then we place Yi into the sub-sample of ordinary observations 
in); 

if II li — 6* AT II > b, then we place Yi into the sub-sample of abnormal observations 
in)- 
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1.2. Then for each 6 > we obtain the following decomposition of the sample Y 
into two sub-samples 

yi(6) = {fi,f2,...,>'jvj, m-ei,\\<h, 
1^2(6) = {n,^^2,...,vwj, m-eNW >b 

Denote by A^i = Ni{b), N2 = N2{b), N = Ni + N2 the sizes of the sub-samples Yi and 
Y2, respectively. 

The parameter b is chosen so that the sub-samples Yi{b) and 1^2 (^) are separated in 
the best way. For this purpose, consider the following statistic: 

1 ^1 . ^2 

'^N{b) = —{N2j2y^-NiJ2Yi). 

1=1 i=l 

1.3. Define the boundary C > and compare it with the value J = max |\l'Ar(6)| 
on the set < 6 < -B. If J < C then we accept the hypothesis Hq about the absence 
of abnormal observations; if, however, J > C then the hypothesis Hq is rejected and 
the estimates of the parameters e = (ei -|- ■ ■ ■ -|- e^). Remark that our primary goal is to 
separate ordinary and abnormal observations in the sample. Evidently, classification 
errors must be small and therefore we have to require some kind of convergence of the 
estimate to its true value ei + ■ ■ ■ + ek- 

1.4. Define the number b*j^: 

b% G arg max ||\&Ar(6)||. 

0<b<B 

Then 

= N2ib*j,)/N. 

2. As a result, we obtain two classes of observations at the first step (ordinary and 
abnormal data) and the estimate of the sum ei + ■ ■ ■ + ek- 

3. Then we remove all found 'ordinary' observations from the sample and repeat 
steps 1 and 2. As a result, we obtain the estimate ei of the parameter ei. 

4. So we proceed further until a sub-sample without switches is obtained (i.e. the 
decision threshold C is not exceeded). As a result, we obtain the estimate /cjv of the 
number of classes k, as well as the estimates of the parameters ei > ■ ■ ■ > > 0. 

Again we remark that the 1st type error for multiple switchings can be estimated 
like in the binary case (we do not formulate this result). As to the 2nd type error 
(i.e. the probability that we stop at the 1st step of the method because the decision 
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threshold is not exceeded) just observe that a binary switch is a particular case of the 
general multiple switching situation (when all beginning from i = 2 are equal to 
zero. 

Therefore 

Pe{2n(i type error, multiple switches} < P^{2nd type error, binary case} 

< Liexp(-L(5)iV), 



for < 5 < ^max^ 11^(^)11 ~ C, where L{6) = mm( ^^^ , -^-;-), Li = ^(/^q. 



5^ H5 . 

Now consider the event {k^ 7^ A;} = {kj^ < k} U {kj^j > k}. 

The event {kj^i < k} means that at a certain recurrent step of the above described 
procedure a sub-sample of remaining observations (after eliminations of all previous 
sub-samples) is considered to be "pure" (i.e. without switches) but in reality it contains 
some more switches. The probability of this event is less than the 2nd type error at 
this step of the procedure. Therefore, 

P,{kN <k}<Li exp{-L{6)N), 

for < 5 = max ||\l/(6)|| — C, where L{5) = min{ 5—, — — ), Li = A(j)Q. 

0<b<B IS^oS' 8^0 

The event {k^ > k} means that finally some more switches are detected in the 

obtained sample than in reality. The probability of this event is less than the 1st type 

error at the final step of the above recurrent procedure: 

P,{kN >k}< Liexp(-L(C)iV), 



Therefore the following theorem holds. 



where L{C) = min{ , — — ), Li = 400. 

16(p^g 800 



Theorem 7. 

Suppose < C < max ||\E'(6)||. Then the 2nd type error probability is estimated 
from above as follows: 

Pe{ 2nd type error } < Liexp{—L{6)N), 

Moreover, the estimate of the number of switchings k^ converges a.s. to the true 
value of A; as — )■ 00 and 



where < 6 = jnsx^ 11^(^)11 ~ C, L{6) = mm( ^^^ , -^-j-), Li 



Pe{kN 7^k}< Li{exp{-L{S)N) + exp{-L{C)N)), 
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where < 5 
6^ H5_ 



max II \I/ 

0<b<B 



C and L{C) 



C2 HC. 



min[ 



lWo9 800 



m 



min[ 



1600^ 8^ 



and the covariance matrix Cov{xi) 



Example 

Suppose we have the model with three classes of multivariate Gaussian observations: 

f{xi) = (1 - ei - ea) fo{xi - hi) + ei /o(xi - /^s) + ^2 fo{xi - h^i), z = 1, . . . , A^, 

where /o(-) has the multivariate Gaussian d.f. with the vector of means /i = (/ii,/i2)' 

0.745 -0.07 
-0.07 0.51 

The problem is to estimate the unknown number of classes /c = 3, parameters 
/ii, ^2, ^3, and ei, e2 by the sample = {xi, . . . , Xn}- 

Concretely, in this model the following parameters were chosen: 

ei = 0.3; 62 = 0.15 

hi = {0 0)', h2 = {l 2)', h = {2 3)'. 

We take the norm of the vectors Xj and so reduce this problem to the univariate 
case considered earlier in this paper. 

For estimation of the decision threshold the above formula (2) can be used: 

log{C) = -0.9490-0.4729*%(A^)+1.0627*/o^(a)-0.6502*%(l-p)-0.2545*%(l-a). 

Again we remark that the main problem is to estimate the number of classes 
(estimation of and ej can be done with the help of some standard methods for the 
given model structure). 

In experiments we estimated the number of classes kjsf and the corresponding error 

eVN = Pe{kN 7^ k}. 

The following results were obtained (each cell of this table is the average in 1000 
independent trials of the test): 

Table 7. 



N 


100 


200 


300 


500 


700 


1000 


1500 


evN 


0.991 


0.910 


0.707 


0.189 


0.049 


0.020 


0.004 
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3.2. Switching regressions 

The following model of observations was considered: 

Vi = + Ui = X(0/3o + (1 - + Ui, 

where 

y is a N X 1 vector of dependent observations; 

X is a N X k matrix of predictors; 

M is a X 1 vector of centered random noises; 

/3j is a A; X 1 vector of model coefficients, Q is a Bernoulli distributed r.v. (indepen- 
dent from Ui) with two states: 1 with the probability (1 — e) and with the probability 
e for a certain unknown parameter < e < 1. Here /3o 7^ /Si- 

In terms, we suppose that regression coefficients of this model can change (switch) 
form the level Po to /3i and the mechanism of this change is purely random. We need 
to test the hypothesis about the absence of switchings for each coefficient (e = 0) and 
in the case of rejection of this hypothesis to construct the estimate of the parameter 
e > 0. 

For solving this problem, consider the OLS estimate of the vector (3i (here and 
below ' is the symbol of transposition): 

A = {x'xy'x'y, = o/3o + (1 - o)/3i + {x'xy'x'u,. 

Since the sequence of noises u is centered, the problem is reduced to the above 
considered problem of detection of switches in the mean of an observed random vector. 
The matrix of predictors X influences only the random component. 

Formally, we need to introduce the following vector I = (1, 1, . . . , 1) (A^ units) and 
consider 

A = [QPo + (1 - Q)Pi] I + {x'xy'x'u, I. 

Then the [k x n) matrix /3j consists of columns of A; x 1 vectors with means /3o 
and Pi changing in a purely random manner. Each component j = 1, . . . , /c of these 
vectors Pi, = 1, . . . , TV is therefore a univariate random sequence 

H = [QPi + {^-QPiv+ii, 2 = 1,. ..,iv, 

where 
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So the problem of detection of changes in regression coefficients is reduced to the 
above considered problem of detection switches in the mean value of a univariate ran- 
dom sequence. Remark that the uniform Cramer and the ^/'-mixing conditions are still 
satisfied for the process C,j , i = ^, ■ ■ ■ , N . As Eui = we get that there exist constants 
^^i > 0, ifi > such that 

1 2 

Ee*«' < e2^'^ , \t\ < Hi, 

for all z = 1, . . . , N, j = 1, . . . ,k. Moreover, we choose the number mo(-) from the 
^/'-mixing condition for , i = 1, . . . , N: for any chosen number 7(x) > 0: < '~f{x) 
for / > mo(x). 

For testing the hypothesis of no switches we again consider the decision statistic 
\i/Ar(6) and compare the maximum of its module with the decision threshold C > 0. 
Then the following theorem holds: 

Theorem 8. 

Suppose e = 0, the d.f. of Ui is symmetric w.r.t. zero anf the -i/^-mixing and the 
uniform Cramer conditions for i = 1, . . . , are satisfied. Then for any threshold 
C > the following upper estimate for the 1st type error probability holds: 

Po{max |^iv(&)| > C} < 4mo(CiV/2) exp(-L(C)A^), 

fe>0 

8mo(CiV/2) ' 16mg(CiV/2),J ' ^^'^^ 
from the uniform Cramer condition. 

In order to consider the 2nd type error we just remark that the considered switching 
regression model is equivalent to the following specification of a model with the binary 
switches in mean: 

f^,{x) = (1 - e)/^.(x - ^i) + 6/ .(x - Pi). 
Denote y = — /3q 7^ and consider the value 

r^i{b)= j f^j{x)xdx. 

Then the following theorem holds. 
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Theorem 9. 

Suppose all assumptions of theorem 8 are satisfied and there exists r* = sup raj {b). 

b 

Suppose also that / j(-) 7^ and continuous. Then for < C < max |\E'sj(6)| we have 
P,{max \^N{b)\ <C}< 4(f)o{CN/2 + r*) exp{-L{6)N) 

b 

where 5 = max |\l/3j(6)| — C > 0, L(6) = mini ^ — , ). 

b Pi IQmf^gi 8mo 

Example 

In the following example the regression model with one deterministic predictor was 
considered: 

yi = ci + C2*i + Ui, Mi~iV(0;l), 2 = l,...,n. 



Table 8. 



e ~ f/[0; 1] 

A, 



e = 0.05 


/3i = [l;l], /32 = [1;2] 




300 


500 


800 


1000 


C 


0.07 


0.05 


0.04 


0.03 




0.87 


0.59 


0.14 


0.004 


i 


0.08 


0.059 


0.052 


0.05 



Table 9. 



ei < e < 1 
< e < ei 



e = 0.1 


/3i = [l;l], /32 = [1;1.5] 


N 


300 


500 


800 


1000 


C 


0.07 


0.05 


0.04 


0.03 


W2 


0.83 


0.65 


0.13 


0.0 


e 


0.15 


0.12 


0.102 


0.10 



Conclusion 

In this paper problems of the retrospective detection/estimation of 'abnormal' ob- 
servations were considered. The detection/estimation method was proposed. It was 
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proved that type 1 and type 2 errors of the proposed method converge to zero exponen- 
tially as the sample size tends to infinity. The asymptotic optimality of the proposed 
method follows from theorem 3. In this theorem the theoretical lower bound for the 
error of estimation of the model's parameters was established. This bound is attained 
for the proposed method (by the order of convergence to zero of the estimation error). 
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Proofs of theorems. 



Proof of theorem 1. 

First, let us prove the following inequality: 

max Po{|^7v(&)| >C}< Li exp(-L2(C)iV), 

fe>0 

where Li,L2{C) are some positive constant and function not depending on A^. 
For the statistic \1/Ar(6) we can write: 

Ni N 
1=1 i=l 



c... ^ , c 




Then 

Po{\^N{b)\ >C}< Po{| ^^1 > + PoWl E > 

i=l i=l 

Further, 

Po{| E > = > + < 

1=1 j=l j=l 

For any x>0, let us choose the number 7(x) from the following condition: 
ln(l + 7(x)) 

where g, H are taken from the uniform Cramer condition. 

For the chosen 7(x), let us find such integer (poi^) > 1 from the T/^-mixing con- 
dition that tpil) < 7(x) for / > (j)o{x). Take x = CN/2 and denote 0o(CiV/2) = 

0o(-), 7(CiV/2)=7(-)- 

n 

For some fixed n denote S'„ = ^ Xj and estimate the probability Po{>S'„ > CN/2}. 

1=1 

Consider the following decomposition of Sn into groups of weakly dependent terms: 
c _ qW I c{2) I I qiM-)) 

= Xi + Xj+^o + ■ I" i = 1, 2, . . . , 0o(")- 

The number of terms within each group is no less than [n/0o(')] ^"^^ more than 
[n/ipQ^-)] + 1 and the ?/^-mixing coefficient between terms within each group is no more 
than 7(-)- 
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Then 

r ''"-'^'^ rN 

<0o(-) max Po{|^«|>-^}. 
i<i<M-) 20o(-) 

k 

Consider Z^*^ = ^ a;(i + 0o(")i) ^^cl obtain the exponential upper estimate for 
i=o 

Po{4'^ > x}, Vx > 0. 

In virtue of Chebyshev's inequahty, we have 

V^{Z^ >x}< e"*" ■ Eoe*^^'' , Vt > 0. 

From 'i/'-mixing condition (see Ibragimov, Linnik, 1971) and choosing 7(-) we have 

Eq ' < (l + 7(-)) Eoexp(^^(^))Eexp(^^(i + 0o))•••Eoexp(^x(^ + 0o^))• 
Therefore, for < t < 



Eoexp(t5„) < (1 +7(-))'^ exp(^t2^iV). 



C fN^. 



Hence, 

Po{^n(x) > ^N} < M-) (1 + 7(-))" exp ^^{t'g - Ct/M'))) ■ 

Taking the maximum of the right hand w.r.t. < t < H and taking into account 
the choice of 7(-) we have 

exp(- ^ X ), 0<t <gH, 
exp(— ———-), t > gH 



As this estimate does not depend of n, we get 



max Po{|^^(6)| > C} < 40o(-) exp(-L(C)iV), 

where 

L G = mm — -^r-, ^rr^ 

Note that we obtained the uniform (w.r.t. the parameter 6) exponential upper 
estimate for the first type error. Therefore, the same upper estimate is valid for the 
probability: 

Po{max 1^^(6)1 > C} < Po{max | V > -N} + Po{| V x,| > -iV}. 



i=\ i=\ 
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In fact, consider the r.v. Un{uj) = max | X] ^^"^ define 



b>0 



4 = 1 



TNi^jj) = min{l < n < iV : I Xi\ = 11^}. 

1=1 

Then 

N k 

Po{Un > CN/2} = Po{E I E I{tn = k} > CN/2} 

k=l i=l 

<Po{| E m>CN/2}, 

i=l 

kmax{iJj) k 

where for any u E \ E \ = max | ^ Xi\. 

i=l l<i<N 

As above, we obtain the uniform upper estimate for the probabihty 
Po{| E ^i\> CN/2}. Therefore, 

i=l 

Po{| max ^jv(&)| > C} < AM') exp{-L{C)N), 

b>0 

where 

L(Cj = mm 



80o(-)'l60g(-k 
Theorem 1 is proved. 

Proof of theorem 2. 

Consider the main statistic: 

(Ni{b) N \ 

i=l i=l J 

We have 

^ A^i 1 ^ " 

i=l n=l 1=1 

th+b 



N n 



eh—b eh—b 



eh+b 

— !■ / f{x)xdx, as — oo 



Here we used the relation 
1 



eh+b 



l^E,^kI{\xk- 9n\ <b) ^ I f{x)dx as iV ^ oo 

1 1 ^ 



eh-b 



34 



Therefore, using the latter relations, taking into account the law of large numbers and 
the relation 

Fi^Xi = eh 

we have 

E,^Ar(6) ^ ^(6) as N ^ oo, 

where \Ef(6) = r{b) - ehd{h). 
For any C > we can write: 

^A\'^Nih)-^m >c}< p,{i i.-Nrm > ^iv}+p,{i^ ^^-^ehdm > ^ 

1=1 i=l 

(3) 

Consider the first term in the right hand: 

Ni{b) A'i(b) 7Vi(fe) 

P,{| J2 S:^-Nr{b)\ > -N} = P,{ ^ > -N+Nr{b)}+P,{J2 ^^ < --N+Nr{b)}. 

i=l 1=1 i=l 

(4) 

Analogously theorem 1, we put x = CN/2+r{b)N, find 4>o{x) = 0o(") corresponding 

to this X, decompose the sum ^ Xi into 0o(-) groups of weakly dependent components 

1=1 

and for each of these groups use Chebyshev's inequality. 

Using considerations analogous to those in theorem 1, finally, for large enough 
we obtain: 

exp{- ), 0<t<gH, 

The second term in the right hand of (4) is estimated from above in the same way. 

As to the second term in the right hand of (3), since Ni{b) < N for any u, we 
obtain an analogous exponential upper estimate for it. 

Again remark that we obtained the uniform (w.r.t. b > 0) exponential upper 
estimate for the error probability. Therefore as in theorem 1 we can prove the following 
exponential estimate: 

P,{max |v[/^(6) - ^{b)\ >C}< 40o(-) \ ^^pj^^^ 

exp(-— — ), C> gH 



35 




For type 2 error we can write: 

P,{m_ax \^N{b)\ <C}< P,{max |^jv(&) - > max|^(6)| -C} 

N 6^^ 
^^P'^'T^ A2f \ ' 0<S<gH, 

I. 80o(-) 

where 6 = max — C. 

b 

This completes the proof of 1). 

As to the proof of 2), remark that the function "^{b) = Ee\I/Ar(6) satisfies the reversed 
Lipschitz condition in a neighborhood of b*. 

In fact, we have ^{b*) = 0, ^'(6*) = and = ifi^h + b*) - f{eh - b*)) + 

b*{f\eh + b*) - f\eh - b*)) = 2{b*)^ f"{u) ^ 0, where 0<u = u{b*) < b*. Therefore 
in a small neighborhood of b* we obtain: 

1^(6) - ^{b*)\ = {b*f \f" {u{b*))\{b - b*f > C{b - b*)\ 

for a certain C = C{b*) > 0. 

Now for any < k, < 1 consider the event \biy — b*\ > n. Then 

Pe{\bN - b*\ > < P,{max |^jv(&^) - > I C^'} < 40o(-) exp(-L(C)iV), 

b Z 

where L[L) = mm 



From this inequality it follows that bjss — )■ b* Pg— a.s. as — )■ oo. 
Then 

eN = N2{bN)/N, hN = 6N/eN 

are the nonparametric estimates for e and /i, respectively. 

In general these estimates are asymptotically biased and non-consistent. For con- 
struction of consistent estimates of e and /i, we need information about the d.f. /o(-)- 
These consistent estimates can be obtained from the following system of equations: 

e^hj^ = On 

1 — e^v foi^N — bN — h^) — foi^N + bN — h^) 



Cat fo{GN + b^) — foidN — &7v) 

The estimates i]\r and are connected with the estimate b^ of the parameter b* 
via this system of deterministic algebraic equations. Therefore the rate of convergence 
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— )■ e and hj^f ^ h is determined by the rate of convergence of 67V to b* (which is 
exponential w.r.t. A^). So we conclude that ^ e and ^ h Pe-a.s. as — )■ 00. 
Theorem 2 is proved. 
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